For our project, we decided to build a robust scraper for Eventbrite based on a chosen location.
Subsequently, we present and explain the structure of our scraper and finish with an analysis showing the potential analytical usage our scraper holds.
Before you start, be a nice scraper and set your user agent. If you don’t know, Google will help you.
set_config(user_agent(""))
#set_config(user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"))
Here, cities are matched with their url formatting on the Eventbrite
website. When running the function create_url (below) with
one of the city names, the ‘text’ is inserted into the url generating a
shortcut to generate each city’s correct url. Feel free to add more
cities.
load_cities_list <- function() {
cities_list <- tribble(~input, ~text,
"madrid", "spain--madrid",
"roma", "italy--roma",
"london", "united-kingdom--london",
"paris", "france--paris",
"barcelona", "spain--barcelona",
"san francisco", "ca--san-francisco",
"berlin", "germany--berlin",
"amsterdam", "netherlands--amsterdam",
"warsaw", "poland--warszawa",
"lyon", "france--lyon",
"marseille", "france--marseille",
"sevilla", "spain--sevilla",
"vigo", "spain--vigo",
"malaga", "spain--malaga",
"zaragoza", "spain--zaragoza",
"marbella", "spain-marbella",
"bilbao", "spain--bilbao",
"san sebastian", "spain--donostia-san-sebastián",
"cadiz", "spain--cádiz",
"granada", "spain--granada",
"valladolid", "spain--valladolid",
"santiago", "chile--santiago",
"buenos aires", "argentina--buenos-aires",
"sydney", "australia--sydney",
"lima", "peru--lima",
"glasgow", "united-kingdom--glasgow",
"dublin", "ireland--dublin",
"lisbon", "portugal--lisboa"
)
return(cities_list)
}
As explained above, we created this function so that you don’t have to manually enter or copy/paste the url from your browser for the cities above.
The function extracts the string stored in the text column of the
cities_list object, which is a data frame containing the input and the
url-ready text fragment we created in load_cities_list. The
extraction is conditioned by the ‘input’ matching the city name with its
url counter-part. Finally, the final url is created and can be stored in
an object for further usage.
Currently, this function will return the events page for tomorrow’s events, but you can change that in the code below. E.g. insert ‘today’ instead of ‘tomorrow’ for today’s events.
create_url <- function(city) {
cities_list <- load_cities_list() # loading city list into function
url_city <- cities_list$text[cities_list$input == city] # returns the city name in Eventbrite's url format
url_root <- "https://www.eventbrite.es/d/"
url_suffix <- "/events--tomorrow/" # you can change this to `/events_today/`
# url_suffix <- "/events--today/"
# combines the parts of the url to create full url
url <- paste0(url_root, url_city, url_suffix)
return(url)
}
This function counts the number of pages corresponding to the city’s events for tomorrow. It scrapes the website for the maximum page number indicated towards the bottom of the page.
It helps estimating the time the scraper will take as it is useful for estimating the number of events. As a rule of thumb, the number of pages can be multiplied by 20, as each page (besides the last) contains 20 events.
We also used the resulting value for looping through pages. As the page number can be found in the url, disinguishing each website, it is easy to scrape each page separately.
count_pages <- function(link){
total_pages <- link |>
read_html() |>
xml_find_all("//li[@class='eds-pagination__navigation-minimal eds-l-mar-hor-3']") |>
xml_text() |>
str_extract_all("\\d+$") |>
pluck(1) # extracts first item
total_pages <- as.numeric(total_pages)
return(total_pages) # returns the number of available pages as an integer
}
This function returns all the links for one page, given in the argument.
It appends the page number to the original link and then scrapes all links to each individual event website. This then returns up to 20 individual links, storing it in a list.
get_event_links <- function(link, page_num) {
link_with_page_num <- paste0(link, "?page=", page_num) # adds page number info to url
html_link <- link_with_page_num |>
read_html() |>
xml_find_all("//section[@class='event-card-details']//div[@class='Stack_root__1ksk7']//a") |>
xml_attr("href")
# remove the repeating links that are automatically scraped
links <- unique(html_link)
return(links) # returns character vector of links
}
This function returns all the event links for all pages for a city.
It includes count_pages() and
get_event_links(). It first gets the number of available
pages, which is then fed into a loop. It loops from page 1 to the
maximum number (the number of available pages) through the
get_event_links() function, appending the links to the
created all_event_links list. This list is then filtered for repeating
links and NA values.
The loop includes Sys.sleep(2), to minimize suspicious
behavior of our scraper.
get_all_event_links <- function(link) {
all_event_links <- c() # creates empty vector
pages_count <- count_pages(link = link) # gets number of available pages at this link
# pages_count <- min(10, pages_count) # don't pull more than 10 pages at once
for (i in 1:pages_count) {
Sys.sleep(2)
# loops through number of pages
# gets a new batch of event links from page i
links <- get_event_links(link = link, page_num = i)
# adds it to the full vector of titles
all_event_links <- c(all_event_links, links)
}
# returns only event links that are not NA and that are unique
return(unique(all_event_links[!is.na(all_event_links)]))
}
This function brings it all together saving all info needed for all pages of a city’s events into a data frame and exports to a csv file.
Concerning the structure: The function relies on two inputs. First,
the link for the city website on Eventbrite, which is created with
create_url. Secondly, the city name must be manually
introduced as a string (e.g. “berlin”). Beware that this will also
determine the column ‘City’ in the data frame and will define the name
of the csv file.
About the function: First an empty data frame object is created in
which data will be stored. Then all event links of the city concerned
are stored in a list using our get_all_event_links()
function. Finally, the file path for the csv file is determined drawing
on the city name string and the date, which is set to the system date +1
as we are scraping tomorrow’s events. The file will be stored in your
current working directory, so check if that suits your requirements. If
you are working in a project, which we highly recommend, the file will
automatically stored in there.
To keep track how far we are along, we created a counter in percentage for which we determine the length of the list containing all links and the starting point ‘j = 0’.
Then, we start looping through the list containing all event links. For each iteration the counter adds +1 to keep track. The link itself is read once in the beginning to avoid reading it at every step in the scraping process, which would strain our computer and the provider. Each iteration is made with one second in between.
The read html is then passed onto each individual scraping code, storing information in the previously created data frame. To avoid hick-ups and error messages stoping the looping process, we implemented tryCatch codes for every single step, imputing NA, if an error occurs. NAs are also inserted into the dataframe if the html does not hold valid information at the defined xPath. This is particularly important, as event websites on Eventbrite are closed once they are sold out. We consider this for our ‘Ticket_Type’ column, where we insert “Sold out” in case the xPath does not find anything. For our last columns, we draw on information stored in JSON format within the html. Again, we wrap every step within a tryCatch condition to prevent the scraper from stopping.
Finally, the progress is computed and returned by dividing the sum of the links looped through by the sum of all available links.
Data is then stored in a csv file, by first checking if that file exists and creating it if not and subsequently appending each new row stepwise.
get_event_info <- function(link, cityname) {
# data frame for the city with all events
shared_df <- data.frame()
# stores all the links we will scrape as character vector
all_links <- get_all_event_links(link)
# name of the CSV file that will be exported from shared_df
file_path <- paste0(cityname, format(Sys.Date()+1, "%Y%m%d"), ".csv")
num_all_links <- length(all_links) # number of links/events
j <- 0 # variable for counting links being scraped
# Loop over each link in 'all_links'
for (i in seq_along(all_links)) {
Sys.sleep(1)
j <- j + 1 # adds one each loop for counting progress
event_link <- all_links[i]
html <- event_link |> read_html()
# Read HTML and extract duration if there is information, if not NA
tryCatch({shared_df[i, "Duration"] <- if (html |>
xml_find_all("//li[@class = 'eds-text-bm eds-text-weight--heavy css-1eys03p']/text()") |>
length() == 0) {
NA
} else {
html |>
xml_find_all("//li[@class = 'eds-text-bm eds-text-weight--heavy css-1eys03p']/text()") |>
xml_text() %>%
.[[1]]
}
}, error = function(e) {
shared_df[i, "Duration"] <- NA
})
# Extract ticket type
tryCatch({
shared_df[i, "Ticket_Type"] <- if (html |>
xml_find_all("//li[@class = 'eds-text-bm eds-text-weight--heavy css-1eys03p']/text()") |>
length() == 0) {
"Sold out"
} else {
html |>
xml_find_all("//li[@class = 'eds-text-bm eds-text-weight--heavy css-1eys03p']/text()") |>
xml_text() %>%
.[[2]]
}
}, error = function(e) {
shared_df[i, "Ticket_Type"] <- NA})
# Extract refund policy. If no info, store as NA
tryCatch({
shared_df[i, "Refund_Policy"] <- if (html |>
xml_find_all("//div[@class = 'Layout-module__module___2eUcs Layout-module__refundPolicy___fQ8I7']//section[@class = 'event-details__section']/div") |>
length() == 0) {
NA
} else {
html |>
xml_find_all("//div[@class = 'Layout-module__module___2eUcs Layout-module__refundPolicy___fQ8I7']//section[@class = 'event-details__section']/div") |>
xml_text() %>%
.[[2]]
}
}, error = function(e) {
shared_df[i, "Refund_Policy"] <- NA
})
# Extract description. If no info, save as NA
tryCatch({
shared_df[i, "Description"] <- if (html |>
xml_find_all("//div[@class = 'eds-text--left']//p") |>
length() == 0){
NA
} else {
html |>
xml_find_all("//div[@class = 'eds-text--left']//p") |>
xml_text() |>
discard(~.x == "") |>
paste(collapse = " ")
}
}, error = function(e) {
shared_df[i, "Description"] <- NA
})
# pull json data
json <-
html |>
xml_find_all("//script[@type='application/ld+json']") |>
pluck(1) |>
xml_text()
tryCatch({json <- fromJSON(json)},
error = function(e) {
json <- NA})
tryCatch({shared_df[i, "LowPrice"] <- json$offers$lowPrice[1]},
error = function(e) {
shared_df[i, "LowPrice"] <- NA})
tryCatch({shared_df[i, "HighPrice"] <- json$offers$highPrice[1]},
error = function(e) {
shared_df[i, "HighPrice"] <- NA})
tryCatch({shared_df[i, "Currency"] <- json$offers$priceCurrency[1]},
error = function(e) {
shared_df[i, "Currency"] <- NA})
tryCatch({shared_df[i, "Organizer"] <- json$organizer$name},
error = function(e) {
shared_df[i, "Organizer"] <- NA})
tryCatch({shared_df[i, "EventStatus"] <- json$eventStatus},
error = function(e) {
shared_df[i, "EventStatus"] <- NA})
tryCatch({shared_df[i, "StartTime"] <- lubridate::as_datetime(json$startDate)},
error = function(e) {
shared_df[i, "StartTime"] <- NA})
tryCatch({shared_df[i, "EndTime"] <- lubridate::as_datetime(json$endDate)},
error = function(e) {
shared_df[i, "EndTime"] <- NA})
tryCatch({shared_df[i, "Title"] <- json$name},
error = function(e) {
shared_df[i, "Title"] <- NA})
tryCatch({shared_df[i, "Subtitle"] <- json$description},
error = function(e) {
shared_df[i, "Subtitle"] <- NA})
tryCatch({shared_df[i, "url"] <- json$url},
error = function(e) {
shared_df[i, "url"] <- NA})
shared_df[i, "City"] <- cityname # city variable for when we complile data
# calculating and reporting the progress of the scraper
print(paste(round(100 * j/num_all_links, 2), "% completed"))
# read the CSV file with all the previously saved events and save as `res`
res <- try(read_csv(file_path, show_col_types = FALSE), silent = TRUE)
if (inherits(res, "try-error")) {
# Save the data frame we scraped above
print("File doesn't exist; Creating it")
write_csv(shared_df, file_path)
} else {
# If the file was read successfully, append the
# new rows and save the file again
combined_df <- bind_rows(res, shared_df[i,])
write_csv(combined_df, file_path)
}
rm(res) # removed `res` to save memory space
}
}
Now it is your turn, in trying out our scraper. First, load the cities and create url for Madrid.
l <- create_url("madrid")
How many pages for Madrid? Insert the link you jus created and run the function. Recall, every page holds 20 events except for the last one.
For Madrid, for 24th March, 2024, there are 8 pages, so roundabout 160 events. That will take some time..
count_pages(l)
## [1] 8
Now you can scrape all the event info - if you are patient.
get_event_info(l, "madrid")
We already gathered information on multiple cities on multiple days. You can find all separate in the folder data in our repository. Subsequently, we will show you a couple of possibilities to play with the data and get a sense of. Of course, results highly depend on which cities you decide to scrape, but that is the whole idea, isn’t it?
As each city-date data is stored in a separate csv file,the first
challenge is to read and combine every data frame to create one single
data frame. To do that with the least possible effort, we create a list
of all csv files in our data folder and loop through each one of them
creating a list containing all data frames. Using
bind_rows() we create a single coherent data frame of all
data frames stored in that list.
# Setting the directory path where csv files are located
folder_path <- "data"
# Getting a list of csv files in the folder
csv_files <- list.files(folder_path, pattern = "\\.csv$", full.names = TRUE)
# Initializing an empty list to store data frames
all_data <- list()
# Looping through each csv file and read it into a data frame
for (file in csv_files) {
data <- read.csv(file)
# Append the data frame to the list
all_data[[length(all_data) + 1]] <- data
}
# Combine all data frames into a single object
combined_data <- bind_rows(all_data)
# Fixed some spelling mistakes in variable City
combined_data <- combined_data |>
mutate(City = case_when(
City == "buenos_aires" ~ "buenos aires",
City == "sevila" ~ "sevilla",
.default = City
))
First, we are interested in how many events per day Eventbrite offers for each city. For that we calculate the average number of rows (events) per day and date. The results are plotted below.
Amsterdam and sydney have the, on average most events posted on Eventbrite. Note, that we had some issues while compiling data for London, which we fixed but are reflected in London’s data, so expect the average number for London to be higher. In Spain, Madrid in Barcelona are in a neck-and-neck race, with Madrid having the edge by about 3.
combined_data$Date <- as.Date(combined_data$StartTime)
average_n_events <-
combined_data |>
group_by(City, Date) |>
summarise(daily_events = n()) |>
group_by(City) |>
summarise(average_daily_events = mean(daily_events))
## `summarise()` has grouped output by 'City'. You can override using the
## `.groups` argument.
ggplot(average_n_events, aes(x = City, y = average_daily_events)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
labs(title = "Average Number of Daily Events per City",
y ="",
x = "",
caption = "Data collected between 09.03.2024 and 14.03.2024")+
geom_text(aes(label = round(average_daily_events, 2)), vjust = -0.5, size = 2.2) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
While the average number of events may be interesting in terms of pure offering, one might want to know the thematic focus of events in a given city.
We develop two analysis strategies to infer the main thematic focus of events in each given city by analysing the titles and the descriptions respectively.
First, we estimate the term frequency by city, providing information which words and thereby topics appear most often in the tiles and descriptions.
Then, we determine the TF-IDF for each city for the titles and descriptions. The TF-IDF allows us to draw conclusions about the most unique events in a given city.
Starting off with the title, we first tokenize the title text of our main data frame, filtering it by stopwords from different languages, the city name itself and digits. Then, we estimate the term frequency for each city.
# getting stop words for English, Spanish, German, and Dutch
stopwords_en <- stopwords("en")
stopwords_es <- stopwords("es")
stopwords_de <- stopwords("de")
stopwords_nl <- stopwords("nl")
stopwords_fr <- stopwords("fr")
stopwords_it <- stopwords("it")
stopwords_pt <- stopwords("pt")
# combining stop words in all languages in one list
all_stopwords <- c(stopwords_en, stopwords_es, stopwords_de, stopwords_nl, stopwords_fr, stopwords_it, stopwords_pt)
# unnesting tokens to prepare for analysis
term_freq_title <-
combined_data |>
unnest_tokens(word, Title) |>
count(City, word, sort = TRUE)|>
filter(!word %in% all_stopwords) |> # filtering out all stop words
filter(!word %in% City)|> # filtering out the city name itself
filter(!grepl("^\\d+$", word)) # filtering out any singular numbers
# summing the total number of words by city and joining with term frequency data frame
total_words_title <- term_freq_title |>
group_by(City) |>
summarize(total = sum(n))
term_freq_title <- left_join(term_freq_title, total_words_title)
## Joining with `by = join_by(City)`
# calculating term frequency
term_freq_title <- term_freq_title |>
mutate(term_frequency = as.numeric(n/total))
To visualize our results we only choose the 10 most frequent words per city.
Generally, events concerning ‘leadership’ seem to be an often offered service on Eventbrite in a lot of cities. Other often mentioned words are ‘free’, most likely referring to free city tours. Interestingly, other city names, like Bilbao and San Sebastian or Leipzig in Berlin are mentioned often. Other widely popular activities mentioned are escape rooms, outdoor activities, and comedy shows. Interestingly, events seem to be rather similar across continents and cities reagarding the most frequently offered ones. Also, English is the predominant language seen here idnicating that the most frequent types of events are rather targeted to an international audience.
top_terms_title <-
term_freq_title |>
group_by(City) |>
arrange(desc(term_frequency)) |>
slice(1:10) |> # Select the top 10 terms
ungroup()
ggplot(top_terms_title, aes(term_frequency, fct_reorder(word, term_frequency, .desc = FALSE), fill = City)) +
# Create the bars
geom_bar(position = "dodge", stat = "identity") +
# Plot settings
theme_minimal() +
theme(legend.position = "none") +
facet_wrap(~City, scales = "free_y", ncol = 3) +
labs(title = "Term Frequency of Event Titles",
x = "",
y = "",
caption = "Data collected between 09.03.2024 and 14.03.2024")
To determine the uniqueness of cities concerning their offered events, we calculate the TF-IDF. It relates the frequency of a word mentioned in one group (here city) with the frequency of that word mentioned in other groups.
Many unqiue terms refer to local places and their names. This is particularly apparent for Amsterdam. Since TF-IDF is language sensitive, national language words are more prominent in this plot (e.g. Valladolid “burgos”). It seems, Madrid is the city of pub crawls ;-).
# calculating tf-idf
title_tf_idf <-
term_freq_title |>
bind_tf_idf(word, City, n) |>
select(-total)
# Plotting the 5 words with highest TF-IDF per city
title_tf_idf |>
group_by(City) |>
slice_max(tf_idf, n = 5, with_ties = F) |>
ungroup() |>
ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = City)) +
theme_minimal() +
geom_col(show.legend = FALSE) +
facet_wrap(~City, ncol = 3, scales = "free") +
labs(title = "TF-IDF for Titles by City",
x = "TF-IDF",
y = NULL,
caption = "Data collected between 09.03.2024 and 14.03.2024")
We now do the same for the event description to verify our results.
Again, we find a lot of language targeting leadership courses (e.g.: managers, leaders, management, career). Also, ‘thinking’ seems to be the most prominent word when describing your event. What do you think?
Again, words in the national language dominate. In Berlin, events tend to be in the evening (“abends”), while Glasgow is an Excel-city. In London, a lot of events seemed to be around Francis Ngannou and his fight against Anthony Joshua. Generally, the descriptions are richer and more divers offering more insights into what is going on in each city.
Lastly, of course, it is also about the money. For that, we calculate
the average price of an event per city in Euros. Because different
countries, of course, use different currencies, we first standardize our
prizes by writing a short price converter function and applying it to
both prices we scraped. LowPrice refers to the lowest price
an event can be accessed. Often, event organizers increase prices as
time goes on and/or demand increases. HighPrice then
captures the final, and highest price an event may be accessed.
# Exchange rates (12.03.2024)
exchange_rates <- data.frame(
Currency = c("USD", "GBP", "EUR", "ARS", "SGD", "CAD", "CHF", "AUD", "NZD", "PLN"), # Currency codes
Exchange_Rate = c(0.91, 1.17, 0, 0.0011, 0.69, 0.68, 1.04, 0.60, 0.56, 0.23) # Corresponding exchange rates to Euros
)
# Merging exchange rate data with combined_data based on the Currency column
combined_data <- merge(combined_data, exchange_rates, by = "Currency", all.x = TRUE)
convert_to_euros <- function(price, currency, exchange_rate) {
if (is.na(currency)) {
return(NA) # Return NA if the currency is missing
} else if (currency == "EUR") {
return(price) # If the currency is already EUR, return the price as it is
} else {
return(price * exchange_rate) # Convert price to Euros using the exchange rate
}
}
# Adding the converted prices to the main dataframe
combined_data$Low_Price_Euros <- mapply(convert_to_euros, combined_data$LowPrice, combined_data$Currency,
combined_data$Exchange_Rate)
combined_data$High_Price_Euros <- mapply(convert_to_euros, combined_data$HighPrice, combined_data$Currency,
combined_data$Exchange_Rate)
Having created this standardized price per event in Euros, we calculate and plot the average low and high price per city.
Concerning the former, Glasgow has the highest average price. Thinking back to what kind of events are offered, this does not seem very surprising, as it seemed as if Glasgow had a lot of educational events about technical skills, which usually tend to be longer and more expensive. Sydney and London follow. This may also be expected since prices are generally higher in Sydney and London also tends to be more expensive, on average.
average_low_price <-
combined_data |>
drop_na(Low_Price_Euros) |>
group_by(City) |>
summarise(average_price = mean(Low_Price_Euros))
ggplot(average_low_price, aes(x = City, y = average_price)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
geom_text(aes(label = paste0(round(average_price, 2), "€")),
vjust = -0.5, size = 2) +
labs(title = "Average Low Price per Event by City",
x = "City",
y = "") +
scale_y_continuous(labels = function(x) paste0(x, "€")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The graph below replicates the results for low prices, but, expectedly so, with higher average prices. In Spain, Madrid ranges around the lowest price category.
average_high_price <-
combined_data |>
drop_na(High_Price_Euros) |>
group_by(City) |>
summarise(average_price = mean(High_Price_Euros))
ggplot(average_high_price, aes(x = City, y = average_price)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
geom_text(aes(label = paste0(round(average_price, 2), "€")),
vjust = -0.5, size = 2) +
labs(title = "Average High Price per Event by City",
x = "City",
y = "") +
scale_y_continuous(labels = function(x) paste0(x, "€")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We discovered some problems with the time an event starts, as event start and end times in other time zones are reported in our time zone when scraping the JSON information as we do. As such, we do not conduct an analysis on the start and end times of events in different cities.
To circumvent this problem, we started working on a rather complex function based on regex’, which scrapes the event time in the HTML and then tries to standadize it. The functions work, but maybe needs some fine tuning and could be made more legible, which is why we have not included it.
The basic idea of the functions is the following:
We scrape the time information from the HTML and standardize it, because different events use different time encodings and for each time encoding very different formats.
pre_convert_time() scrapes the HTML for the time
information and then returns it in a standardized format: dd:dd - dd:dd.
The format may end in am or pm, which is important for the following
function.
convert_time_format() then takes that output and
returns, based of the input string, the start time for each event in
military time (dd:dd). This is done, by adding 12 to each input time
string that ends with pm. It then returns only the former time (start
time) as for many events the organizer only provides the start time in
military time.
This function takes the html of an event and prepares it for
convert_time_format().
filter_list <-
function(list) {
return(list[sapply(list, function(x) is.character(x) && nchar(x) > 0)])}
pre_convert_time <- function (html) {if (
html |>
xml_find_all("//div[@class = 'date-info']//span") |>
length() == 0) {
NA
}
else {
html |>
xml_find_all("//div[@class = 'date-info']//span") |>
xml_text() %>%
filter_list() %>%
str_extract_all("\\b(?:\\d+:\\d+(?:pm|PM|am|AM)?|(?:\\d+:)?\\d+pm|\\d+PM|\\d+am|\\d+AM|\\d+ - \\d+:\\d+(?:pm|PM|am|AM)?|\\d+ - \\d+(?:pm|PM|am|AM)?)") %>%
.[[1]] %>%
paste0(collapse = " - ") %>%
{if (str_detect(., "\\b\\d{1}(?![0-9]|:)(?:pm|PM|am|AM)")) { # Matches single-digit numbers not followed by another digit or colon, followed by "pm" or "am"
gsub("\\b(\\d{1})(pm|PM|am|AM)", "0\\1:00\\2", .)
} else if (str_detect(., "\\b(\\d{1}):?")) { # Matches singular numbers followed by a colon
gsub("\\b(\\d{1}):", "0\\1:", .)
} else if (str_detect(., "\\d{1} - \\d{1}(?:pm|PM|am|AM)?")){
gsub("(\\d{1,2}) - (\\d{1,2})(pm|PM|am|AM)", "0\\1:00 - 0\\2:00\\3", .)
} else if (str_detect(., "\\d{1,2}(?:pm|PM|am|AM)?")){
if_else(str_detect(., "pm|PM"), gsub("(\\d{1,2})(pm|PM|am|AM)?", "0\\1:00 - 00:00pm", .),
gsub("(\\d{1,2})(pm|PM|am|AM)?", "0\\1:00 - 00:00", .))
} else if (str_detect(., "\\d{1}")) {
gsub("(\\d{1})", "0\\1:00", .)
} else {
paste(.)
}
} %>%
gsub("\\b(\\d)( - |$)", "0\\1:00\\2", .)%>%
gsub("^\\b(\\d{2}:\\d{2})(pm|am|PM|AM)$", "\\1 - 00:00\\2", .) %>%
gsub("^(\\d{2}:\\d{2})$", "\\1 - 00:00\\2", .)%>%
gsub("(?<=\\s\\d{2})(pm|am|PM|AM)\\b", ":00\\1", ., perl = TRUE)%>%
gsub("(\\d{1,2})(am|pm|AM|PM)\\s", "\\1:00\\2 ", ., perl = TRUE) %>%
gsub("\\b(\\d):(\\d{2}(am|pm|AM|PM)?)\\b", "0\\1:\\2", .) %>%
gsub("(?<!\\d:)(\\b\\d{2})\\b\\s", "\\1:00 ", ., perl = TRUE) %>%
gsub("(pm|am) - ", " - ", .)%>%
gsub("(am|AM)$", "pm", .)
}
}
This function converts a time string into a format usable for the final data table.
convert_time_format <- function(time_string) {
# Check if "pm" or "PM" is present in the input string
if (is.na(time_string)) {
return(NA)
} else {
if(str_detect(time_string, "pm|PM")) {
# Extract hours and minutes from the input string
time_parts <- str_match(time_string, "(\\d+):(\\d+) - (\\d+):(\\d+)(pm)?")
start_hour <- as.integer(time_parts[2])
start_minute <- as.integer(time_parts[3])
end_hour <- as.integer(time_parts[4])
end_minute <- as.integer(time_parts[5])
# Convert to 24-hour format
if (start_hour != 12) {
start_hour <- start_hour + 12
}
# Return the start time string in the desired format
result <- sprintf("%02d:%02d", start_hour, start_minute)
} else {
start_time <- str_match(time_string, "(\\d+):(\\d+) - (\\d+):(\\d+)")
start_hour <- as.integer(start_time[2])
start_minute <- as.integer(start_time[3])
time_string <- sprintf("%02d:%02d", start_hour, start_minute)
# Return the original time string if "pm" or "PM" is not present
result <- time_string
}
return(result)
}
}